Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Text and Non-text Segmentation based on Connected Component Features

Identifieur interne : 000017 ( Main/Exploration ); précédent : 000016; suivant : 000018

Text and Non-text Segmentation based on Connected Component Features

Auteurs : Viet Phuong Le [France] ; Nibal Nayef [France] ; Muriel Visani [France] ; Jean-Marc Ogier [France] ; De Cao Tran [Viêt Nam]

Source :

RBID : Hal:hal-01319903

Abstract

Document image segmentation is crucial to OCR and other digitization processes. In this paper, we present a learning-based approach for text and non-text separation in document images. The training features are extracted at the level of connected components, a mid-level between the slow noise-sensitive pixel level, and the segmentation-dependent zone level. Given all types, shapes and sizes of connected components, we extract a powerful set of features based on size, shape, stroke width and position of each connected component. Adaboosting with Decision trees is used for labeling connected components. Finally, the classification of connected components into text and non-text is corrected based on classification probabilities and size as well as stroke width analysis of the nearest neighbors of a connected component. The performance of our approach has been evaluated on the two standard datasets: UW-III and ICDAR-2009 competition for document layout analysis. Our results demonstrate that the proposed approach achieves competitive performance for segmenting text and non-text in document images of variable content and degradation.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Text and Non-text Segmentation based on Connected Component Features</title>
<author>
<name sortKey="Le, Viet Phuong" sort="Le, Viet Phuong" uniqKey="Le V" first="Viet Phuong" last="Le">Viet Phuong Le</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Visani, Muriel" sort="Visani, Muriel" uniqKey="Visani M" first="Muriel" last="Visani">Muriel Visani</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Cao Tran, De" sort="Cao Tran, De" uniqKey="Cao Tran D" first="De" last="Cao Tran">De Cao Tran</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-90277" status="VALID">
<orgName>Can Tho University</orgName>
<desc>
<address>
<addrLine>3-2 Street, Cantho City</addrLine>
<country key="VN"></country>
</address>
<ref type="url">http://www.ctu.edu.vn/en/</ref>
</desc>
<listRelation>
<relation active="#struct-367763" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-367763" type="direct">
<org type="institution" xml:id="struct-367763" status="INCOMING">
<orgName>Université de Cantho</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Viêt Nam</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01319903</idno>
<idno type="halId">hal-01319903</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01319903</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01319903</idno>
<date when="2015-08-23">2015-08-23</date>
<idno type="wicri:Area/Hal/Corpus">000119</idno>
<idno type="wicri:Area/Hal/Curation">000119</idno>
<idno type="wicri:Area/Hal/Checkpoint">000005</idno>
<idno type="wicri:Area/Main/Merge">000017</idno>
<idno type="wicri:Area/Main/Curation">000017</idno>
<idno type="wicri:Area/Main/Exploration">000017</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Text and Non-text Segmentation based on Connected Component Features</title>
<author>
<name sortKey="Le, Viet Phuong" sort="Le, Viet Phuong" uniqKey="Le V" first="Viet Phuong" last="Le">Viet Phuong Le</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Visani, Muriel" sort="Visani, Muriel" uniqKey="Visani M" first="Muriel" last="Visani">Muriel Visani</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-40831" status="VALID">
<orgName>Laboratoire Informatique, Image et Interaction</orgName>
<orgName type="acronym">L3I</orgName>
<desc>
<address>
<addrLine>Bâtiment Pascal Avenue Michel Crépeau F-17042 La Rochelle Cedex 1</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-lr.fr/l3i</ref>
</desc>
<listRelation>
<relation name="EA2118" active="#struct-300311" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="EA2118" active="#struct-300311" type="direct">
<org type="institution" xml:id="struct-300311" status="VALID">
<orgName>Université de La Rochelle</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">La Rochelle</settlement>
<region type="region" nuts="2">Poitou-Charentes</region>
</placeName>
<orgName type="university">Université de La Rochelle</orgName>
</affiliation>
</author>
<author>
<name sortKey="Cao Tran, De" sort="Cao Tran, De" uniqKey="Cao Tran D" first="De" last="Cao Tran">De Cao Tran</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-90277" status="VALID">
<orgName>Can Tho University</orgName>
<desc>
<address>
<addrLine>3-2 Street, Cantho City</addrLine>
<country key="VN"></country>
</address>
<ref type="url">http://www.ctu.edu.vn/en/</ref>
</desc>
<listRelation>
<relation active="#struct-367763" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-367763" type="direct">
<org type="institution" xml:id="struct-367763" status="INCOMING">
<orgName>Université de Cantho</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Viêt Nam</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Document image segmentation is crucial to OCR and other digitization processes. In this paper, we present a learning-based approach for text and non-text separation in document images. The training features are extracted at the level of connected components, a mid-level between the slow noise-sensitive pixel level, and the segmentation-dependent zone level. Given all types, shapes and sizes of connected components, we extract a powerful set of features based on size, shape, stroke width and position of each connected component. Adaboosting with Decision trees is used for labeling connected components. Finally, the classification of connected components into text and non-text is corrected based on classification probabilities and size as well as stroke width analysis of the nearest neighbors of a connected component. The performance of our approach has been evaluated on the two standard datasets: UW-III and ICDAR-2009 competition for document layout analysis. Our results demonstrate that the proposed approach achieves competitive performance for segmenting text and non-text in document images of variable content and degradation.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
<li>Viêt Nam</li>
</country>
<region>
<li>Poitou-Charentes</li>
</region>
<settlement>
<li>La Rochelle</li>
</settlement>
<orgName>
<li>Université de La Rochelle</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Poitou-Charentes">
<name sortKey="Le, Viet Phuong" sort="Le, Viet Phuong" uniqKey="Le V" first="Viet Phuong" last="Le">Viet Phuong Le</name>
</region>
<name sortKey="Nayef, Nibal" sort="Nayef, Nibal" uniqKey="Nayef N" first="Nibal" last="Nayef">Nibal Nayef</name>
<name sortKey="Ogier, Jean Marc" sort="Ogier, Jean Marc" uniqKey="Ogier J" first="Jean-Marc" last="Ogier">Jean-Marc Ogier</name>
<name sortKey="Visani, Muriel" sort="Visani, Muriel" uniqKey="Visani M" first="Muriel" last="Visani">Muriel Visani</name>
</country>
<country name="Viêt Nam">
<noRegion>
<name sortKey="Cao Tran, De" sort="Cao Tran, De" uniqKey="Cao Tran D" first="De" last="Cao Tran">De Cao Tran</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000017 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000017 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01319903
   |texte=   Text and Non-text Segmentation based on Connected Component Features
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024